Lab 4: Working with 'real' data

Introduction

In this lab we are going to work on how to estimate the background from 'real' data. Real is in air quotes because the data is actually from simplified simulations to make the problems manageable in a single lab. But the data will have some features that resemble that of real data sets.

Getting data and HD5

In general exchanging raw data is a pain. Writing data in text files is error prone, inaccurate, and wastes space, but raw binary files have all sorts of subtleties too (including the internal byte order of your processor). To try and sidestep a whole set of nasty issues, we are going to use HDF5 (originally developed by the National Center for Supercomputing Applications for data exchange) to allow everyone to import the data.

The code below imports the data needed for this lab to the cloud Python server.

Problem 1

In this problem we are looking at the data from a gamma-ray satellite orbiting in low Earth orbit. It takes a reading of the number of particles detected every 100 milliseconds, and is in an approximately 90 minute orbit. While it is looking for gamma-ray bursts, virtually all of the particles detected are background cosmic rays.

As with most data, there are 'features.' Your lab instructor has helpfully incorporated the meta-data into your data file.

1) The data has 4 columns and more than 25 million rows. The columns are time (in gps seconds), Solar phase (deg) showing the position of the sun relative to the orbit, Earth longitude (deg) giving the position of the spacecraft relative to the ground, and particle counts.

Make a few plots, generally exploring your data and making sure you understand it. Give a high level description of the data features you see. Specifically comment on whether you see signal contamination in your data, and how you plan to build a background pdf().

First all, let's try the scatterplot of time vs. solar phase. Because of the huge size of data, it will be time-consuming to include all of them. Let's choose our data to be every 10kth data from the original vector.

There is a clear repetitive pattern going on! We can zoom in on the first few patterns to look into details.

Approximately, the pattern repeats every 5400 gps sec. Notice that the interval for time is 100 ms. In another word, the pattern repeats every 54000 data points.

Now let's plot time vs. longitude. We still look at only first few patterns of data points(first 500k, every 100th data points.)

Just as solar phase, the longitude is also repeating the pattern with respect to time, around every 5.4 gps sec. Regarding the data collected for solar phase and earth longitude, there is no data contamination, as their pattern of the data perfectly repeats without disturbance or outliar.

Now let's see how the counts of Gamma rays are dependent on time.

It is easy to tell the count of Gamma Rays have a repetitive pattern from the repition of their spikes aronund every 5.4k gps seconds as well!

Let's see how the density of data points is like by plotting a 2d histogram plot for the first 500k points(every 100th point) at different time.

From the histogram, the data points gather more around between 5 to 10, indicated by their brighter region in our 2d histogram.

If we plot the 1d histogram of the number of data points with different values, will we get a distribution that we already knew?

Its appearance resonates with one of the discrete distributions we already knew: Poisson! The peak of our data locates at 6. Hence let's compare how similar it is with the real Poisson($\lambda = 6).$

From the comparison of two histrograms and their logarithmic plots, they are more similar(although not compeletly same) for smaller x value, but the tails of two plots diverge from each other and have a larger difference(more obvious from the log graph). It is the indication that the data for Gamma Rays counts might be contaminated. This raw data defninitely does not follow a Poisson distribution.

Well. As mentioned previously, the solar phase and earth longitude is also having periodic pattern, as our data does. What if the data contamination comes from the aspect of different soalr phase and longtiude? If we can figure out how our data is related with these two variables, we can figure out our background pdf.

2) The background is not consistent across the dataset. Find and describe as accurately as you can how the background changes.

Let's put the solar phase, earth longtitude and our data for counts of Gamma rays into one graph with respect to time to see if their repetitions are related with each other.

Looking into the graph, the patterns of solar phase is slightly out of phase with a different period compared with the Gamma Ray raw data. But the longtitude has a same period of repetition as Gamma Ray data does. We can conclude the solar phase is not directly related to the pattern of Gamma Ray counts, but the longtitude is.

Another feature this graph tells is that, every time when the value of longtitude starts is decreasing rapidly in each repetition, there will be a huge spike occured in the number of Gamma Rays detected: This might be the source of our data contamination! It exaplains why our raw data distribution has a longer trails than the real Poisson distribution. If we exclude the regions where those abnormal spikes occur from our consideration, we might obtain a more accurate background distribuiton for the counts of Gamma Rays.

For finding the change with time dependence, we can fold our data by peiord(~5.4k gps second, or 54k data points), so that we obtain the distribution of Gamma Ray data dependent on time in one period.

Let's look at the distribtution at a specific time in one period.

There is still discrenpency existing at the real, but now the distribution of our data, for most part, is more close to be a real Poission distribution.

Now we know the distribution at each time in one period is given by a possion distribution, the only thing that can change dependent on time will be the mean of the distribution. We can use mean at each time as a parameter and see how it is dependent on time in one period. The reason why we choose 2700 as our starting point is because that's where a full pattern starts from the spike to the end.

An obvious trend of average number of counts of gamma ray! It has a constant trend of declining. Let's assume it follows a exponential decreasing function. We can apply a exponential regression on our data

From here, we get the function of how the average number of counts of Gamma Rays detected changes with respect to time in one period!

3) Create a model for the background that includes time dependence, and explicitly compare your model to the data. How good is your model of the background?

In part one, we had the conclusion that the data corresponding to a certain value of time(folded by period) approximately follows a poisson distribution. And in part 2, we figured out the average value of the data corresponding to a certain value of time decreases exponentially in one period. Combining these two results together, we can come up with our time-dependent model in one period: the background will follow a a poisson distribution, with its mean given by the exponential function of time from part 2.

Let's try a few values of time to check how good our model fits the real data.

The values of time we choose to test on will be 20000, 24000, 28000, 32000, 36000, 40000(As we mentioned before, there is data contamination at the beginning of each period, and hence we start our test from the middle of the period).

Discrepency still exists, especially at the tail of our plots. However, our model fits the real data pretty well for most parts in our plots.

4) Because the background varies, your discovery sensitivity threshold (how many particles you would need to see) also varies. What is the '5-sigma' threshold for a 100 millisecond GRB at different times?

Again, we can use the same set values of time to test their corresponding '5-sigma' threshold.

The 5 sigma threshold shifts from 25 to 22 along one period for the points we tested.

Problem 2

In this problem we are going to look at a stack of telescope images (again simulated). We have 10 images, but you and your lab partner will be looking for different signals. One of you will be looking for the faintest stars, while the other will be looking for a transient (something like a super novae that only appears in one image). Flip a coin to determine which of you is pursuing which question.

In this lab, we choose to look for a transient stars.

1) Dowload the data from images.h5. This is a stack of 10 square images, each 200 pixels on a side.

2) Explore the data. Is there signal contamination? Is the background time dependent? Is it consistent spatially? Develop a plan to calculate your background pdf()

Let's firstly see what the image looks like:

We then plot all the images in our image stacks

Judged from human eyes, there is no obvious change between any pair of the images, and it's very likely to be time-independent. For furthur confirmation, let's find the average intensity of all pixels in each image and compare them.

There are tiny fluctuations existing between images, but the general trend shows the background is time-independent.

As we are looking for the transient star that only appears at one image, we can subtract previous image from the one we are interested in, what we are left with will simply be background noise and potential transient stars (general stars will be eliminated).

The difference between consecutive images seems to follow a Gaussian distributions! If the log plots look parabolic, it's more certain that they have Gaussian distributions.

They are pretty close to be parabolic! We can safely conclude that their PDF follows Gaussian distributions. However, there are few outliars that are only visible on Log plots. Those outliars represent a high intensity difference, which may indicate the existence of transient stars(high brightness)!

Obviously the mean of the distribution locates at around 0, and the standard deviation of the distribution will be found by calculating the standard deviation of our data set.

Let's see how the distribution fits for the image difference between image 2 and 1

3) Using your background distribution, hunt for your signal (either faint stars, or a transient). Describe what you find.

From part 2, the background distribution is given by a Gaussian distribution. Now we can hunt for our transient signal based on this distribution. We can set up a threshold in our distribution, and all the data points whose intensity exceeds this threshold will be classified as a transient signal point.

Let's try 5$\sigma$ as our threshold first

Our threshold is around 4 for all the images

Now we look for the number of data points having a value over the threshold.

Looks like 5$\sigma$ is too high as our threshold in this case, as it excludes all the data points.

Let's try 4$\sigma$ as our threshold.

Using 4$\sigma$ as threshold, we got 2+1+5+2 = 10 data points that can be classified as transient stars!

4) You and your lab partner had different pdf(), but were using the same data. Explore why this is.

Although we are manipulating the same data sets, our goals are different, which need different methods to deal with the data sets. For my partner who is interested in finding the faintest stars, he might need to find the average of all the images to singularize the dim pixels. For my case, we look into the difference between each consectuive pair of images. Those two methods will surely give us different results for our pdf